The Crúbadán Project: Corpus building for under-resourced languages
نویسندگان
چکیده
We present an overview of the Crúbadán project, the aim of which is the creation of text corpora for a large number of under-resourced languages by crawling the web.
منابع مشابه
Endangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR
This project approaches the problem of language documentation and revitalization from a rather untraditional angle. To improve and facilitate language documentation of endangered languages, we attempt to use corpus linguistic methods and speech and language technologies to reduce the time needed for transcription and annotation of audio and video language recordings. The paper demonstrates this...
متن کاملCross-language F0 modeling for under-resourced tonal languages: a case study on Thai-Mandarin
This paper proposed a novel method for F0 modeling in under-resourced tonal languages. Conventional statistical models require large training data which are deficient in many languages. In tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages. With cross-language F0 contour mapping, we can augment the F0 model of one under-re...
متن کاملAn Iterative approach to extract dictionaries from Wikipedia for under-resourced languages
The problem of extracting bilingual dictionaries from Wikipedia is well known and well researched. Given the structural and rich multilingual content of Wikipedia, a language independent approach is necessary for extracting dictionaries for various languages more so for under-resourced languages. In our attempt to mine dictionaries for under-resourced languages, we developed an iterative approa...
متن کاملDutchSemCor: Building a semantically annotated corpus for Dutch
State of the art Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years, while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor pro...
متن کاملAnalysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation
Lack of sufficient linguistic resources and parallel corpora for many languages and domains currently is one of the major obstacles to further advancement of automated translation. The solution proposed in this paper is to exploit the fact that non-parallel bior multilingual text resources are much more widely available than parallel translation data. This position paper presents previous resea...
متن کامل